A deep tensor-based approach for automatic depression recognition from speech utterances

Depression is one of the significant mental health issues affecting all age groups globally. While it has been widely recognized to be one of the major disease burdens in populations, complexities in definitive diagnosis present a major challenge. Usually, trained psychologists utilize conventional methods including individualized interview assessment and manually administered PHQ-8 scoring. However, heterogeneity in symptomatic presentations, which span somatic to affective complaints, impart substantial subjectivity in its diagnosis. Diagnostic accuracy is further compounded by the cross-sectional nature of sporadic assessment methods during physician-office visits, especially since depressive symptoms/severity may evolve over time. With widespread acceptance of smart wearable devices and smartphones, passive monitoring of depression traits using behavioral signals such as speech presents a unique opportunity as companion diagnostics to assist the trained clinicians in objective assessment over time. Therefore, we propose a framework for automated depression classification leveraging alterations in speech patterns in the well documented and extensively studied DAIC-WOZ depression dataset. This novel tensor-based approach requires a substantially simpler implementation architecture and extracts discriminative features for depression recognition with high f1 score and accuracy. We posit that such algorithms, which use significantly less compute load would allow effective onboard deployment in wearables for improve diagnostics accuracy and real-time monitoring of depressive disorders.


Introduction
Depression is a mental health issue often characterized by low mood, sadness, and negative thoughts, loss of interest in day-to-day activities, and is often associated with an individual's interaction involvement and articulation rate were higher in healthy controls. Another interesting study reported by Long et al. [16] examined several speech types such as read speech, interviews, and picture description and emotion types such as positive, negative, and neutral for their discriminative power for depression versus normal speech classification. Experiments on a dataset of 74 subjects using an SVM classifier demonstrated that interview speech and neutral emotion contribute more towards recognition of depression from speech than other speech and emotion types. The study in [17] introduced a new dataset PRIORI, collected from everyday smartphone conversation recordings and utilized it to study the change of emotional activation and valence in depressed and manic phases of Bipolar Disorder. Furthermore, in an independent research study, Cummins et al. [18] investigated the effect of speaker normalization for depression classification performance as mental-health disorders are highly speakerspecific, and also, the speakers for depressed and healthy controls were different. Feature normalization for reducing speaker variabilities were shown to improve recognition performance when MFCC and formant-based features were used. All these techniques relied on handcrafted features and traditional classifiers such as Gaussian Mixture Models (G.M.M), Support Vector Machines (SVM), etc., focusing on identifying relevant feature set for robust classification of depressed speech from healthy controls. Multimodal approaches using audio, text, and facial geometry features have also been investigated [19][20][21][22][23]. Alghowinem et al. investigated the fusion of information from speech, head pose, and eye gaze behaviors for depression/normal classification on a dataset of 30 depressed and 30 healthy controls collected by Black Dog Institute [19,24]. The authors leveraged different feature selection and fusion techniques, and found that t-test based feature selection performed well for binary depression/normal classification. Moreover, the individual modality's performance was also reported, with speech showing the maximum recognition accuracy of 83%, further strengthening the idea that speech alone contains sufficient information for robust depression recognition. Also, in [20], new video and text features are proposed, and a hybrid of deep and shallow networks are used for depression classification using audio, video, and text modalities. Individual modalities such as audio and video were modelled using DCNN-DNN based system, while text modality was modelled using Paragraph Vector (P.V.) based SVM system. Moreover, in [22], an LSTM based system was explored to simultaneously model depression from audio and text sequences without performing explicit topic modelling of the content of the interviews. Also addressing the AVEC 2016 depression sub challenge, the work in [23] used i-vector framework with MFCC features for audio data modelling and geometrical features along with polynomial parametrization of facial landmarks was used in a late-fusion fashion for depression classification. From recent literature in depression classification, it is prominent that different combinations of modalities have been explored to demonstrate a robust system. However, another major observation which can be derived from such studies is the higher performance using audio modality, which serves as a motivating factor to further explore audio based depression recognition.
With progress in the deep learning field and increased computation efficiency, the dependence on hand-crafted features is reduced. Deep learning has facilitated efficient end-to-end modelling of complex paralinguistic phenomenon which is difficult to assess using traditional techniques. Deep learning has been successfully applied to the task of automated diagnosis and modelling such as Bipolar Disorder [17], anxiety [25], alzheimer's dementia [26], clinical depression [27] etc. Much of the recent work has explored the use of time-frequency-based speech representations such as spectrograms and log-mel spectrograms as input for deep learning architectures to classify depression from audio. Srimadhur et al. [28] investigated spectrograms as well as raw waveform as input to CNN-based network on a subset of DAIC--WOZ dataset in speaker-dependent fashion. Moreover, in the study by Ma et al. [29], a CNN-LSTM based architecture was explored that extracted discriminative features from melspectrograms using 1d convolution in the first layer. A random sampling strategy was also proposed to mitigate the data imbalance issue associated with the DAIC-WOZ dataset. The majority voting of the labels for segments of speech coming from an individual is used for depression prediction for an individual. In a recent study by Vazquez-Romero et al. [30], an ensemble of 1d-CNN networks is used with mel-spectrograms as input features. The label for an individual is generated by the mean of the segment level probabilities for each constituent network in the ensemble, and the ensemble labels are averaged to yield a final label for the individual. This ensemble technique demonstrated appreciable improvements in recognition performance over hand-crafted features based on SVM classification and other single deep learning-based networks.
Multiple instance learning (MIL) is the apt choice when a single label is available for a group of utterances as in Depression classification problem [31]. The majority of the approaches in literature exploiting MIL architecture works by generating labels for individual segments and averaging them to yield a final label for the whole utterance. This is done using a network that shares parameters with all the segments of an utterance [32,33]. However, the inherent problem with the MIL framework for depression classification is that not all the segments of the utterance exhibit depression-related characteristics, with the majority of the segments being in a neutral emotional state. As such, false labels are predicted quite often due to the majority of neutral state segments.
Motivated by these limitation of the extant modelling methodologies, we developed a Tensor-based approach to extract shared and discriminative features from multiple segments of an utterance. Tensor factorizations provide a natural method for analyzing common information spread across modes of a tensor [34]. Utilizing this aspect, we use tensor factorization in conjunction with neural network-based learning to address the multiple-instance learning in a novel framework. Furthermore, the utterance level tensor core generated by the feature extraction block is passed on to an attention mechanism to generate the utterance level attentive feature. Statistic pooling of attentive representations is performed to extract bag-level features, which are classified using fully connected layers. This mitigates the dependence on average/ max pooling output labels for individual segments for utterance level prediction, thus countering the inherent issue of traditional MIL frameworks. The proposed tensor based MIL approach for depression classification outperforms several state-of-the-art methodologies and provides a promising avenue for robust depression classification from speech signals.

Tensor preliminaries
We review the introductory multilinear algebra, which is necessary to understand Tucker decomposition. A detailed, comprehensive review of tensor algebra can be found in [34,35]. Sticking with the notations used in tensor literature, a vector is denoted by a lowercase letter (e.g. 'a'), a matrix with an uppercase letter (e.g. 'A') and tensors of order three or more by calligraphic letters(e.g. 'A').
Tensors are multidimensional arrays e.g. X 2 R I 1 �I 2 �����I n , where n is the number of modes in the tensor, also referred to as order of the tensor, which may correspond to space, time, frequency, trials, utterances etc and I n specifies the dimensionality of the mode corresponding to nth mode of the tensor X. Tensor manipulation often requires its reshaping to matrix form, and one such particular reshaping is called mode-n matricization or unfolding. For a third order tensor X 2 R I 1 �I 2 �I 3 , mode-n matricization is achieved by fixing one index and varying the other two. It is denoted by X ðnÞ 2 R I n �ðI 1 �I 2 �����I nÀ 1 �I nþ1 �����I N Þ , where the column vectors of X (n) are the mode-n vectors of X. For N matrices, one corresponding to each mode, we denote it using a superscript in parenthesis, example U (n) .
Mode-n multiplication of a tensor X with a matrix U is obtained by multiplying all the vector fibers of a mode-n matrix with the matrix U. It is denoted as Y ¼ X� n U, and in matrix form it can be written as Multilinear subspace requires the understanding of multilinear projections as a tensor subspace is defined as a mapping from high-dimensional space to a low-dimensional space [36]. Considering the general case of higher order tensors, an Nth order tensor denotes real vector spaces and � represents the tensor outer product (for details see [34]). As such, the tensor space for N order tensors consists of the outer product of N vector spaces R n , n 2 1, 2, � � �, N. A tensor X 2 R I 1 �I 2 �����I N can be projected onto a lower dimensional tensor Y 2 R P 1 �P 2 �����P N , where P n � I n using N projection matrices U ðnÞ 2 R I n �P n , one corresponding to each mode of the tensor.

Tucker decomposition.
Tucker decomposition of a third order tensor Y 2 R I 1 �I 2 �I 3 is defined as a multilinear transformation of a core tensor, generally of small size and dense, by the factor matrices corresponding to each mode of the tensor [34,37].
Here, U ð1Þ 2 R I 1 �P 1 , U ð2Þ 2 R I 2 �P 2 and U ð3Þ 2 R I 3 �P 3 corresponds to the subspaces along mode-1, mode-2 and mode-3 respectively The subspaces consists of the basis vectors obtained from matrix unfolding along each mode of the tensor. Tucker decomposition has the constraint of orthogonality and ordering on the core tensor and factor matrices, while other constraints such as non-negativity, sparsity, etc. can also be imposed.
A matrix representation of the tucker decomposition, in general case, can be achieved by matricizing Y and X as [38] where � denotes the Kronecker product. The decomposition can also be written as a linear combination of Q N n¼1 I n rank one tensors.

Dataset and preprocessing
For the task of depression classification from speech signals, we use the audio modality from the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ), which is a subset of the larger corpus DAIC [39] and was introduced in the Audio/Visual Emotion Challenge (AVEC) 2016 [40]. The dataset consists of clinical interviews conducted between a participant and a virtual interviewer ellie which was controlled by a human interviewer remotely. The dataset was collected with the motive to augment the diagnoses of psychological conditions such as stress, anxiety, depression, etc., through automatic computer applications based on verbal and non-verbal indicators. It consists of audio, facial geometry features as well as text transcriptions of the interviews. Table 1 shows the distribution of participants according to gender for the train, validation and test partitions. The dataset is recorded in English from a population of 189 subjects comprising 146 depressed subjects and 43 healthy controls. The duration of the audio ranges from 7-33 min (average 16 minutes). Each participant's audio file has been given a PHQ-8 score by the psychologist, which denotes the severity of depression, with 0 being no depression to 22 being severely depressed. Also, a binary PHQ-8 score is also provided, which classifies participants as depressed/not-depressed. Furthermore, the train-development-test split provided by the AVEC 2016 challenge divides the dataset into partitions comprising of 118, 24, 47 participants in the train, development, and test set, respectively.
Since the virtual interviewer's speech is not a part of the analysis, a silence region-based segmentation technique from the Python library PyAudioAnalysis [41] is employed to segment out the participant's speech and discard the speech segments from the virtual interviewer as it doesn't contain any emotion information. Also, the speech segments produced are of different duration, and deep learning techniques such as CNN and TFNN [42] require equal length input, so the speech segments are either zero-padded or truncated to 7 secs duration. The sampling rate of the speech signal is 16 kHz.

Methodology
This section discusses the Tensor Factorization-based Multiple-Instance Learning Technique, which is used for the classification of depression versus normal speech from multiple utterances of a single speaker. Furthermore, an utterance level attention followed by a statistics pooling layer [43] is employed to extract temporal features in the subsequent layers of the network. Moreover, a standard Multiple-Instance Learning (MIL) network based on Convolution layers is also discussed, which serves as a baseline for comparing results.

CNN and 2D TFNN based MIL framework.
Multiple Instance Learning with CNN as a base architecture has been explored in many previous works [44,45]. As such, we have used this architecture as a baseline in our work. The base CNN architecture comprises of 3 feature learning blocks followed by vectorization of the deep features and classification using a sigmoid layer. Each feature learning block comprises a 2D convolution layer, a batch normalization layer, an activation layer, and a max-pooling layer. The convolution layer extracts local features with the help of trainable kernels. Batch normalization forces the mean of the features over the entire batch to be centered at zero with unit variance. The normalized features are passed through an activation function (ELU in our work). Finally, a max-pooling layer is employed to reduce the size of the feature maps obtained, keeping the relevant information only. Given a bag of utterances belonging to a speaker, the base CNN architecture is employed on each of the utterances to yield a label for each utterance. A global max-pooling of the labels yields the final label for the bag of utterances. A 2D TFNN architecture [46] is employed as a base network for the MIL, similar to the CNN architecture. The 2D TFNN base receives mel spectrograms extracted from speech utterances as input. The factor matrices corresponding to the time and frequency modes extract the core feature tensor from the input tensors. Four consecutive Tensor FF layers yield the final feature tensor, which is then used to generate a class probability by doing an inner product with a weight matrix of the exact dimensions as the feature tensor. Fig 1 shows the end-to-end system for 2D CNN and 2D TFNN based MIL architecture.
2.3.2 3D TFNN architecture as feature extractor for MIL. The 3D TFNN architecture was introduced in [46] for emotion recognition from speech. The 3D TFNN serves as a natural framework for Multiple Instance Learning as the core idea of Tensor Factorization is capturing the shared information across different modes of a tensor. As such, given a bag of utterances belonging to a speaker, the utterances are first converted to the 2D speech representations such as mel-spectrograms of dimensions I freq × I time . The mel-spectrograms for each utterance are stacked along the 3rd dimension to form a 3D-tensor of dimensions I freq × I time × I utter representing the bag of utterances. The 3D tensor is passed through successive Tensor Factorization layers to obtain the deep feature tensors. Finally, a tensor sigmoid layer, comprising a weight tensor of the same size as the deep feature tensor, is utilized to get the probability for the bag of utterances.
The 3D TFNN architecture for Multiple Instance Learning benefits from not repeating the same architecture individually on each utterance as in conventional CNN-based MIL systems. Moreover, the probability generated by the 3D TFNN represents the entire bag as opposed to CNN-based MIL, where a global max-pooling of the labels generates a bag-level label. This comes from the inherent capability of Tensor Factorization-based feature extraction. The shared information across mel-spectrograms of utterances for an individual is utilized to conclude the label for that particular speaker. In contrast, the utterance level information is independent in conventional MIL systems, and no shared information across utterances is utilized. Fig 2 shows the proposed end-to-end Tensor factorization based approach for MIL.

3D TFNN with utterance level attention.
In this technique, the 3D TFNN described in 2.3.2 is utilized to extract deep tensor features from 3D tensor representations of bags of utterances. The feature tensor now comprises utterance level representations stacked along the third dimension of the feature core tensor. For each 2D slice of the 3D feature tensor, an utterance level attentive feature representation is generated using the following attention mechanism.

Attention layer.
The attention layer used in our work is based on the attention proposed in [47]. The attention layer takes in a sequence of high-level feature vectors, focuses on the depression-related parts employing attention weights, and generates an utterance level attention feature vector representing the depression-related frames of the input sequence. Given a 2D slice H 2 R I 2 �I 3 of 3D feature tensor tensor X 2 R I 1 �I 2 �I 3 , where I 1 , I 2 , I 3 represents the number of utterances, number of mel filter bands and number of frames respectively, normalized attention weights are first computed using a softmax function as described in equation - where t 2 (1, 2, � � �, T), T being the total number of frames in the feature tensor slice and h t being a feature vector belonging to the tth frame. The utterance level feature vector is obtained by taking the weighted sum of the attention weights with h t as following - 2.3.3.2 Statistics pooling. The statistics pooling was first introduced in [43] for extracting utterance level statistics from frame-level features embeddings generated using a Time Delay Neural Network for speaker verification tasks. In our proposed architecture, statistics pooling is employed to extract bag level statistics-mean and standard deviation from the utterance level attentive feature vectors. As such, the output of the statistics pooling layer aggregates the relevant discriminative information obtained from several speaker utterances and provides a unified feature for further classification objectives. Given a set of attention feature vectors C ¼ ðc 1 ; c 2 ; � � � ; c I 1 Þ and c 2 R I 2 , obtained as described in section 2.3.3.1, where I 1 represents the number of utterances in the bag, the statistics pooling is calculated using mean, which is the average and var, which is the variance - This results in a pooled feature vector of dimensions R 2�I 2 , with μ and σ concatenated for each entry of c.

Fully connected layer.
The output from the statistics pooling layer contains the aggregation of information across several utterances of a speaker. The pooled feature vector is passed to a fully connected network, having two layers to reduce the dimensionality and extract additional high-level features. Finally, the output of the fully connected layers is passed on to the last layer with sigmoid activation to generate the classification probability of being depressed/ normal.

Experimental setting.
The four architectures-baseline CNN-MIL, TFNN-MIL, 3D TFNN, and 3D TFNN+Attention, are evaluated on the DAIC-WOZ dataset for Depression classification. For tensor formation, a set of utterances or bag sizes in the range [10,60] are selected from each speaker. Thus multiple tensors are formed for each speaker considering multiple bags formed because of the bag size chosen without repetition of utterances. For the training scenario, each individual bag of utterances is considered coming from a new speaker bearing the same label as all the other children bags of the parent speaker, thereby generating a large number of tensors for training. However, for the testing scenario, the label for the parent speaker is calculated by averaging the predicted probability of all the children bags and comparing the final averaged probability against a threshold. The threshold is calculated from the ROC curve generated using the validation data.
Mel spectrograms are computed from the speech segments to be used as input for the Tensor Factorized Neural Network and baseline CNN architecture. For the computation of mel spectrograms, the speech segments are first windowed using a hamming window of 2048 samples with a shift of 512 samples. The windowed signal is used to compute Short-Time Fourier Transform (STFT). The magnitude spectrogram obtained from STFT is then passed through a mel-scale to obtain the filterbank energies. A log operator is finally used to get the log-mel spectrogram.
For baseline CNN architecture, the number of filters in the first and second feature learning block is 64 with a kernel size of 3 × 3 and a shift of 1. For the third feature learning block, the number of filters is 128 with kernel size 2 × 2. The activation function used in all feature learning blocks is ELU and a max-pooling with kernel size of 2 × 2 is used. The feature maps generated after the third feature learning block is vectorized and passed through a fully connected network with sigmoid non-linearity in its last layer to generate probabilities for the depressed versus non-depressed categories.
For the TFNN-MIL system, the base architecture consists of four consecutive 2D Tensor Feed Forward layers. The features dimension produced from the Tensor FF layers are respectively 120 × 210, 110 × 200, 100 × 180 and 80 × 160. The output from the fourth Tensor FF layer is used to calculate logits using an inner product with a weight tensor of dimensions 80 × 160. Finally, the logits are passed through the activation function to yield utterance segment-level probabilities. This base architecture is repeated for all the instances in the bag, and a final global average pooling of the probabilities generates the bag level probability.
For 3D TFNN architecture, the input tensor is of size num utter × 128 × 219 where the dimensions refer to the number of utterances, mel filters, and the number of time frames, respectively. The input mel-spectrogram tensor is passed through two 3D tensor feed-forward layers where the core tensors are of size num utter × 120 × 200 and num utter × 100 × 180 respectively. The activation function used in both the Tensor FF layers is RELU. The feature tensor obtained after the second Tensor FF layer is fed to a Tensor sigmoid layer. The output of the inner product of the feature tensor with a trainable weight tensor of the same size is passed through a sigmoid non-linearity to generate class probability.
In the case of 3D TFNN+ Attention architecture, two 3D tensor FF layers, as used in 3D TFNN architecture above, extract discriminative feature tensor of the size num utter × 100 × 180. The utterance level attention mechanism generates utterance level feature vectors of dimensions num utter × 100. This feature sequence is passed to a statistics pooling layer generating a feature vector of dimensions R 200 , which is passed through two fully connected layers of dimensions 256, 256 and a last layer having sigmoid non-linearity to generate class probability for the bag of utterances.

Results
The four architectures-baseline CNN-MIL, TFNN-MIL, 3D TFNN, and 3D TFNN+Attention, are trained and evaluated on the DAIC-WOZ dataset using the following metricsweighted accuracy, unweighted accuracy, and F1-score. Since the dataset is highly imbalanced, unweighted accuracy and F1-score becomes the apt choice to highlight the true prediction capability of the models. Moreover, another inherent issue with class imbalanced datasets is threshold-moving, which makes the default threshold of 0.5 for binary classification problems shift. For our work, we have utilized the optimal threshold calculated from the ROC curve on the validation dataset, which is the development partition of the dataset. The optimal threshold is then used to generate labels for the probabilities predicted for the test set.
As seen from the Table 2, the 3D TFNN and 3D TFNN + Attention architecture outperforms the baseline CNN-MIL system by a considerable margin of 16.67% and 17.2% respectively in terms of UA. This justifies that Tensor Factorized Neural Networks are more suitable for MIL-based systems due to their common information capturing capability amongst several modes of the tensor input. Moreover, the 3D TFNN+Attention system provides a balance of overall accuracy to average of class accuracies. This becomes important for imbalanced datasets where the model's chances of fitting towards the majority class are always high. Moreover, in terms of F1-score, 3D TFNN outperforms other techniques and reaches the state-of-the-art. Fig 3 presents the confusion matrices for the four architectures on the test set of the DAIC-WOZ dataset, taking 30 utterances per tensor. It is evident from the confusion matrix in Fig  3d that 3D TFNN+Attention architecture can balance the model toward both depressed and non-depressed categories, followed by 3D TFNN architecture. This supports our proposal of using utterance level attention to generate attentive feature vectors per utterance segment. Moreover, the impact of the number of utterances per tensor on the recognition performance of the model is assessed in Fig 4. The range for the number of utterances per tensor is considered in the interval [10,60]. The figure is plotted using b-spline interpolation [48] to account for the fewer data points and getting a smooth curve. As is evident from the graph, the model performs best when 30 utterances are chosen per tensor. Also, the performance shows a gradual decline in the accuracy when the number of utterances per tensor is increased. This may be because redundant information apart from the desired objective is also being captured with increasing utterances, which accounts for increased confusion and decreased accuracy.

Comparison with State-of-the-Art
Several studies have utilized Daic-WoZ Depression dataset for unimodal as well as multimodal depression recognition [20,49]. Since in this investigation, we have considered only the audio modality, the performance is compared with other studies using audio modality only. Moreover, few studies have reported the final results which are limited on the development partition of the dataset. More importantly, our work utilizes the test set as the unseen data; we compare with similar works reporting results on test partition. Also, the published studies are segregated upon the metrics used to give a fair comparison and restricted to the ones which have used accuracy and F1-score as metrics have been included for comparison. Table 3 presents the state-of-the-art techniques for Depression recognition from speech utterances using the DAIC-WOZ dataset. Valstar et al. [40] provided the baseline results for the DAIC-WOZ dataset using both the audio and video modality. Our novel implementation outperforms the baseline by 0.21 for the mean F1-score for the audio modality scenario. Previously, Ma et al. [29] utilized a combination of CNN and LSTM networks to extract high-level features from raw speech representations and uses a random sampling strategy to balance out the examples between depressed and normal classes. In contrast, our investigation uses a weighted loss function to alleviate the imbalance of classes and thereby incorporate all the training speakers during model training. As such, our proposed architecture achieves an overall performance gain of around 9% in terms of accuracy.

Discussion
Several features have been investigated in literature for depression diagnosis from speech utterances. This study focused on mel-spectrograms for two reasons. First, mel-spectrogram has proven to contain para-linguistic information present in speech utterances such as emotional states [50], cough [51] etc. Secondly, spectrograms provide a natural 2D tensor form for speech utterances. The proposed Tensor-Based MIL techniques tries to exploit the time-  frequency information spread across several utterances of a speaker. The 3D TFNN extracts shared information across the mel-spectrograms of a speaker, thus trying to model the temporal information spread across multiple utterances in an interview setting. The 3D core tensor, which is the feature tensor, is comprised of the coefficients of interactions across the subspaces corresponding to each of the modes-time subspace, frequency subspace and utterance subspace. Moreover, when using utterance-level attention, the model tries to extract more relevant information pertaining to depression from each utterance by the means of self-attention. This in turn refines the feature extraction process by producing attentive feature vectors for each utterance in the tensor. To aggregate the information extracted using attention layers, statistics pooling is used, which generates a combined feature vector for all the utterances in the tensor. The proposed techniques are computationally efficient as using Tensor Factorization based architecture significantly lowers the number of trainable parameters [46].

Conclusion
In this work, we present a tensor-based architecture for the task of Multiple Instance Learning when a collection of utterances for a speaker is available, and inferences about the speaker label have to be drawn using the feature set from utterances. The conventional MIL architectures such as the baseline CNN-MIL system described in Fig 1 suffer from the inherent drawbacks of not considering relative and shared information across the utterances in a bag. These techniques rely on inferring labels for individual utterances and finally averaging or max-pooling the labels to infer the speaker-level labels. The tensor-based architectures solve this problem by considering the utterances as the third mode in addition to the time and frequency modes in speech spectrograms. As such, TFNNs, by its rich mathematical framework, try to capture the shared information across the utterances of a bag by tensor factorization where the input tensor is projected over three subspaces-time subspace, frequency subspace, and utterance subspace. This helps to leverage the shared information and generate a single speaker/bag level probability for the specified task. To this end, we have implemented two tensor MIL architectures-3D TFNN and 3D TFNN+Attention. Comparison with the state-of-the-art proves that both these novel techniques are effective in capturing depression-related information across bags of utterances. Moreover, additional analysis on the optimal number of utterances per bag is also presented to shed light on the model performance when using varying bag sizes.